Skip to content

Conversation

@roomote
Copy link
Contributor

@roomote roomote bot commented Aug 10, 2025

Changes:\n- Add waitForServerReady to guard against race conditions on initial connect.\n- Add transient error detection (-32000/connection closed) and single retry for tools/call and resources/read.\n- Update callTool and readResource to use readiness + retry.\n- Add new tests (McpHub.retry.spec.ts) covering readiness wait and transient retry.\n- All MCP tests pass (McpHub.spec.ts, useMcpToolTool.spec.ts, and new retry spec).\n\nThis improves robustness when an MCP server has just started or after install, reducing flaky 'Connection closed' errors on the first request.


Important

Enhances McpHub with server readiness checks and transient error retries, adding tests for robustness.

  • Behavior:
    • Adds waitForServerReady in McpHub.ts to ensure server readiness before initial requests.
    • Implements transient error detection and retry for -32000/connection closed errors in callTool and readResource.
  • Functions:
    • Updates callTool and readResource to use readiness check and retry logic.
    • Adds isTransientClosedError to identify transient errors.
  • Tests:
    • New tests in McpHub.retry.spec.ts for readiness wait and transient retry.
    • Ensures callTool and readResource retry once on transient errors.
    • Verifies server readiness wait before first tool call.
  • Misc:
    • All MCP tests pass, including McpHub.spec.ts and useMcpToolTool.spec.ts.

This description was created by Ellipsis for 69f66b6. You can customize this summary. It will automatically update as commits are pushed.

@roomote roomote bot requested review from cte, jr and mrubens as code owners August 10, 2025 05:22
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Aug 10, 2025
Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing my own code is like debugging in a mirror - everything looks backwards but the bugs are still mine.

timeoutMs: number = 10000,
): Promise<ConnectedMcpConnection> {
// Immediately error if there's no connected-type connection to wait on (preserves legacy error for tests/UX)
const initial = this.findConnection(serverName, source)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? The connection check happens outside the loop, which could lead to a race condition if the connection is removed between the initial check and entering the loop. Consider moving the findConnection call inside the loop for safer operation.

} catch (error) {
// Handle initial transient "Connection closed" on first invocation after install/start
if (this.isTransientClosedError(error)) {
await delay(200)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this retry delay configurable or at least define it as a constant? The magic number 200ms appears in both callTool and readResource. Something like: private static readonly TRANSIENT_ERROR_RETRY_DELAY_MS = 200;

const msg = error instanceof Error ? error.message : String(error)
if (typeof code === "number" && code === -32000) return true
if (typeof code === "string" && /-?32000/.test(code)) return true
return /connection closed|closed before|socket hang up|ECONNRESET|EPIPE|ERR_STREAM_PREMATURE_CLOSE/i.test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex pattern is quite broad and might catch unrelated errors. Could we be more specific about which error patterns should trigger a retry? For example, we might want to exclude certain EPIPE errors that indicate permanent failures rather than transient issues.

mkdir: vi.fn().mockResolvedValue(undefined),
}))

describe("McpHub transient-retry and readiness", () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test coverage for the happy path! Consider adding edge case tests: What happens when waitForServerReady times out? What happens on the second retry failure (should not retry a third time)? Error handling when connection is removed during the wait period. These would help ensure the retry logic is robust under various failure scenarios.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 10, 2025
@daniel-lxs
Copy link
Member

Closing for now since this doesn't have an issue that I can repro and test against

@daniel-lxs daniel-lxs closed this Aug 12, 2025
@github-project-automation github-project-automation bot moved this from Triage to Done in Roo Code Roadmap Aug 12, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants